Consider a table join where only a small percentage of the rows in one of the tables is required to satisfy the join.
For example, consider a data warehouse for a large organisation whose customer
table contains 1 million customers. Any customer-based report generated from this warehouse would need to join an even larger fact table to the customer table, but given that most such reports would be reporting on a reduced set of fact data (eg. for a single region, or date), it is unlikely that any one report would need to read a large proportion of the customer
table.
With a sparse join involving two large tables, the two most common join methods provide sub-optimal performance:
An Indexed Nested Loops join would identify candidate rows from the fact table, and then lookup the customer table for each one. Whilst efficient for small data volumes, Indexed Nested Loops is not scalable to larger volumes and is unsuitable for reports and batch jobs.
A Hash join requires that the smaller table be hashed and then probed by the rows from the larger table. Whichever table is hashed, a full table scan of the sparse table (customer
in the example above) is still required. This seems unwarranted since only a small percentage of its rows is required.
Sparse joins fall into two distinct categories, each with a separate solution:
Where the reduced set of rows in the sparse table is determined by a column in the sparse table. eg. Customers in a particular region.
If such joins are common, it may be appropriate to partition the sparse table. For example, creating List Partitions on customer.region
would reduce the full table scan to a full scan of a single partition when joining only customers of a given region.
Where the reduced set of rows in the sparse table is determined by a column in the other table. eg. All customers who placed an order in September.
If appropriate, the sparse table can be Hash Clustered on the primary key. The join would use Nested Loops with the Hash Clustered table as the outer (second) table, but would escape the performance problems of the Indexed Nested Loops join by avoiding an index altogether; rows from the sparse table can be accessed directly by their Hash Key.
Both of these solutions involve radical changes to the structure of the sparse table. They are not so much a solution to a tuning problem as a design alternative that should be implemented when a system is first being built. Most importantly, no table should be either clustered or partitioned by a developer; in a live system this should only be performed by an experienced DBA after full regression testing in a test database.